Directed acyclic graphs

Let’s assume we want to generate a data set with the following variables:

Before we start simulating anything, we might want to think about the possible causal relationship between these variables. Drawing graphs is a good way to do this. Here are some possible scenarios:

library(dagitty) 
par(mfrow = c(1,3))
dag1 = dagitty(
  "dag{
  D->B;
  B->C;
  D->C
  }")
ccord.list = 
  list(
    x=c(D=0,B=2,C=1),
    y=c(D=0,B=0,C=1))
coordinates(dag1) = ccord.list
drawdag(dag1, cex = 2)
dag2 = dagitty(
  "dag{
  D->B;
  D->C
  }")
coordinates(dag2) = ccord.list
drawdag(dag2, cex = 2)

dag3 = dagitty(
  "dag{
  D->B;
  B->C
  }")
coordinates(dag3) = ccord.list
drawdag(dag3, cex = 2)


These graphs are build from two type of components:

All graphs are made of nodes and edges. What is special about directed acyclic graphs is that

  1. the relationship between two variables has a direction, such that variables at the start of the arrow are causes and variables at the end are effects
  2. a succession of arrows must not form a circle, i.el. a variable cannot cause itself, even indirectly1. For example, \(\small D\rightarrow B\rightarrow C\rightarrow D\) is not allowed.

To simulate data from a DAG, we have to decide what distributions the variables have. To make it simple, we just use normally distributed data. The order in which we simulate variables is determined by the DAG. The first think we do is to check which variables are exogenous to our model (DAG), because these have to be modeled first.

Let us focus on the left DAG, which we show here again:

drawdag(dag1, cex = 0.5, lwd = 1)

Here, \(D\) is the exogenous variable, so we simulate it first.

N = 100
D = rnorm(N)


\(\small B\) and \(\small C\) are both endogenous variables, but \(\small B\) influences \(\small C\), so we have to simulate \(\small B\) first:

B = 1*D + rnorm(N)


Finally, we can calculate \(\small C\) as an effect of \(\small D\) and \(\small B\)

C = 1*D + 0.2*(B) + rnorm(N) 

A naive regression analysis

As a recap, here is the model for a simple linear regression:


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) C ~ dnorm(mu, sigma)
Linear model \(\mu_i = \alpha + \beta B_i\) mu[i] <- a + b*B[i]
Prior \(\alpha \sim Normal(0,.5)\) alpha ~ dnorm(0, .5)
Prior \(\beta \sim Normal(0,2)\) beta ~ dnorm(0, 1)
Prior \(\sigma \sim Exponential(1)\) sigma ~ dunif(1)


Here we put the data together and run the model:

coefkable = function(m,caption) {
  cap = paste("Coeffcients for model",caption)
  m %>% 
    precis %>% 
    as.matrix() %>% 
    kable(caption = cap) %>% 
    kable_styling(full_width = F)
}
d = data.frame(C = C, B = B, D = D)

m.C_B <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a+bB*B,
    a ~ dnorm(0,.5),
    bB ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
# un-hide previous code chunk for the coefkable function
coefkable(m.C_B, "m.C_B")
Coeffcients for model m.C_B
mean sd 5.5% 94.5%
a 0.0045898 0.1128595 -0.1757815 0.1849611
bB 0.7319693 0.0739242 0.6138243 0.8501144
sigma 1.1576259 0.0811570 1.0279213 1.2873305

And we plot the associations implied by prior and posterior:

par(mfrow = c(1,2))
prior = extract.prior(m.C_B)
mu = link(m.C_B,
          post=prior,
          data=list(B=c(-3,3)))
matplot(c(-3,3),
        t(mu[1:50,]),
        xlab = "B", ylab = "C", ylim = c(-4.5,4.5),
        type = "l", col = "blue", lty = 1)
title("Prior predictive samples")

B_seq = seq(from=-3,to=3,length.out=30)
mu = link(m.C_B,data=list(B=B_seq))
mu.mean <-apply(mu,2,mean)
mu.HPDI <-apply(mu,2,HPDI)
plot( C~B,data=d, cex = .5,
      col=adjustcolor("black", alpha = .2),
      ylim = c(-4.5,4.5),
      xlim = c(-3,3))
lines( B_seq,mu.mean,lwd=2, col = "blue")
shade( mu.HPDI, B_seq, col = adjustcolor("blue", alpha = .2))
title("Posterior predictive mean & HDPI")

So if we disregard our knowledge about daily reading duration for parents, we conclude that there is a strong relationship between number of books in a household and child reading ability. If one would be willing to interpret the association as causal, one could even think that number of books in a household are an important cause of child reading ability.

This example makes clear that we can come to wrong conclusions if our analysis model, here the simple linear regression \(\small C \sim Normal(\alpha+\beta B,\sigma)\) (the right DAG in the first figure), does not match the data generating process (or model), which is \(\small C \sim Normal(\alpha+\beta_1 B + \beta_2 D,\sigma)\) (the left DAG in the first figure).

The question then is, how can we figure out what the correct data generating process (true DAG) is? Domain knowledge is obviously useful. In addition one can also use the fact that DAGs (can) imply conditional independencies. Two variables are conditional independent on a third variables when the first two variables are unrelated given that one knows the value of the third value or fixes it to a specific value.

Let’s look at the left and right DAGs from above again:

par(mfrow = c(1,2))
drawdag(dag1, cex = 0.5, lwd = 1)
drawdag(dag3, cex = 0.5, lwd = 1)

For the left DAG:

  • if we condition on \(\small D\), \(\small B\) and \(\small C\) are still dependent because of \(\small B \rightarrow C\)
  • if we condition on \(\small B\), \(\small D\) and \(\small C\) are still dependent because of \(\small D \rightarrow C\)
  • if we condition on \(\small C\), \(\small D\) and \(\small B\) are still dependent because of \(\small D \rightarrow B\)

So if the left DAG is correct, \(\small D\) and \(\small C\) are still related, after we account for the relationship of \(\small B\) and \(\small C\).

On to the right DAG:

  • if we condition on \(\small D\), \(\small B\) and \(\small C\) are still dependent because of \(\small B \rightarrow C\)
  • if we condition on \(\small B\), \(\small D\) and \(\small C\) are independent
  • if we condition on \(\small C\), \(\small D\) and \(\small B\) are still dependent because of \(\small D \rightarrow B\)

Determining conditional dependencies gets harder when DAGs are more complex. In this case one can use software like the R package dagitty to determine conditional independencies as follows:

impliedConditionalIndependencies(dag1)


impliedConditionalIndependencies(dag2)
## B _||_ C | D
impliedConditionalIndependencies(dag3)
## C _||_ D | B

Now that we have learned that under some DGPs there should be conditional independcies, how can we test them?

Remember that we defined conditional independence as independence given that we already know the third variable. One way to incorporate what we already know about a third variable is to use multiple (linear) regression.

Assuming that \(\small D\) is our outcome of interest, we can check if in a model that uses both \(\small D\) and \(\small B\) to predict \(\small C\), \(B\) remains associated with \(C\). If it does, this speaks in favour of dag1. If it does not, it speaks in favour of dag3.

Multiple regression

Extending the linear model

To go from a simple to a multiple linear regression, we simlpy add a predictor variable to the linear predictor:

What Notation quap R-code
Simple linear model \(\mu_i = \alpha + \beta_1 B_i\) mu[i] <- a + bB*B[i]
Multiple linear model \(\mu_i = \alpha + \beta_B B_i + \beta_D D_i\) mu[i] <- a + bB*B[i] + bD*D[i]
m.C_BD <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a + bD*D + bB*B,
    a ~ dnorm(0,0.2),
    bD ~ dnorm(0,1),
    bB ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
coefkable(m.C_BD,"m.C_BD")
Coeffcients for model m.C_BD
mean sd 5.5% 94.5%
a 0.1214580 0.0820478 -0.0096703 0.2525863
bD 1.1209870 0.1321768 0.9097430 1.3322310
bB 0.1457425 0.0893884 0.0028826 0.2886024
sigma 0.8827576 0.0620665 0.7835632 0.9819519

Looking at the quap model, you can see that we use the same parameters for the priors for bD and bB. This makes sense because both variables have the same scale (variability). We might want to change this is the variables would have different scales / variabilities.

Because this is a fairly simple model and less model checking is needed, we look next at the regression coefficients from the first model m.C_B, which used only \(\small B\) as a predictor, and on the last model we estimated, m.C_BB,which used \(\small B\) and \(\small D\).

plot(coeftab(m.C_BD,m.C_B),par=c("bD","bB"), xlim = c(-.05,1.15))

First, we check the conditional independence that would justify to only use \(\small B\) as a predictor, namely that \(\small \beta_D\) or bD is (close to) zero in the coefficient from the m.C_BD model. The figure shows that this is clearly not the case.

The other thing we can see is that the estimate association between number of books in the household and is weaker in the model which uses an analysis consistent with the correct data generating process. This is an example of confounding bias. If we omit a common cause of our predictor and outcome from the analysis, this leads to a biased estimation of the association between predictor and outcome. Also note that the regression parameters we estimated with the m.C_BD model are consistent with the values we use to generate the data above.

Mediation, direct and indirect effects

Did we now successfully estimate the effect of daily reading hours on child reading?

To answer this question, we first need to establish that we can estimate two types of effects

  • direct effects, i.e. effect that go only via the direct arrow from the predictor to the outcome
  • total effects, the sum of the direct effects and indirect effects that are “communicated” via one ore more mediator variables.

The m.C_BD model estimates the direct effect but not the total effect, because the coefficient for \(\small D\) is calculates while adjusting for the indirect effect via \(\small B\). If we want to estimate the total effect of \(\small D\), we need to take \(\small C\) only as an effect of \(\small B\). Note that this is true for the current model. If there would be another confounder, i.e. a common cause for \(\small B\) and \(\small c\), we would have to include it into the regression.

Here is the regression model:

m.C_D <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a+bD*D,
    a ~ dnorm(0,.5),
    bD ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
coefkable(m.C_D,"m.C_D")
Coeffcients for model m.C_D
mean sd 5.5% 94.5%
a 0.1644847 0.0881132 0.0236628 0.3053066
bD 1.2897205 0.0843174 1.1549650 1.4244760
sigma 0.8916258 0.0626398 0.7915153 0.9917362

Now we can plot the coefficients from the m.C_D and m.C_BD models together:

plot(coeftab(m.C_BD,m.C_D),par=c("bD"))

As expected, the total effect is larger than the direct effect.

Think before you regress, and afterwards

One important consequence from thinking systematically about the process that produced the data, and representing and analysing them with DAGs, is that what variables belong into a regression depends on two things:

  • What is the (assumed) data generating structure
  • What do we want to estimate

Depending on the answers to these questions, we different variables have to be included into a regression.

One common mistake, which has the name Table 2 fallacy, where Table 2 is often the 2nd table in medical articles2, is to interpret multiple coefficients from a regression model as causal effects. This is in most cases not a valid approach.

Plotting to learn

Plotting against residuals tell us what multiple regression does

We discussed earlier that in multiple regression each regression coeffcient captures the effect of a variable conditional on that one accounts for the effect of all other variables.

This means that we only look at the effect of the variation in that variable that is independent from the variation of the other variables. If we look for example on the two variables \(\small B\) and \(\small D\), these are correlated:

plot(D,B)

If we run a regression like \(\small C \sim Normal(\alpha + \beta B, \sigma)\) the coefficient \(\beta\) will capture the the effect of the complete variation of \(\small B\), which includes some variation due to \(\small D\). This becomes obvious if you remember that we simulated B = 1*D + rnorm(N).

One way to do a regression that returns the effect of \(\small B\) that is not due to \(\small D\) is to create a new variable \(\small B_R\) that does not include information that is also in \(\small D\). We can do this by regressing \(\small B \sim Normal(\alpha + \beta D, \sigma)\) and calculating \(\small B_R\) as the difference of the linear predictor (\(\small \mu_i\) or \(\small mu[i]\)) and \(\small B_R\).

The following figure shows observed and predicted values on the left, where vertical lines from observed values to the regression line are residuals. The right hand side show the residuals plotted again \(\small D\). As expected these are uncorrelated, which means that \(B_r\) does not include any information that is also in \(D\).

m.B_D <-quap(
  alist(
    B ~ dnorm(mu,sigma),
    mu <- a+bD*D,
    a ~ dnorm(0,.5),
    bD ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)

mu = colMeans(link(m.B_D,data=d))

mu.l = colMeans(link(m.B_D,data=list(D = c(-3,3))))
par(mfrow = c(1,2))
idx = sample(nrow(d),50)
plot(D[idx],B[idx], ylab = "B (estimated)", xlab = "D")
lines(c(-3,3),mu.l, col = "blue")
for (i in idx) {
  lines(rep(D[i],2), c(B[i],mu[i]), col = "blue")
}
plot(D[idx],B[idx]-mu[idx], ylab = expression(B[r]), xlab = "D")

So we can now use \(Br\) as a predictor of \(\small C\) and plot the regression line:

d$Br =  B - mu
m.C_Br <-quap(
  alist(
    C ~ dnorm(mu,sigma),
    mu <- a+bBr*Br,
    a ~ dnorm(0,.5),
    bBr ~ dnorm(0,1),
    sigma ~ dexp(1)),
  data=d)
coefkable(m.C_Br,"m.C_Br") 
Coeffcients for model m.C_Br
mean sd 5.5% 94.5%
a 0.0437931 0.1538313 -0.2020591 0.2896453
bBr 0.1429145 0.1633824 -0.1182020 0.4040311
sigma 1.6166835 0.1129496 1.4361682 1.7971988

Here are the regression lines for the model with \(\small B_R\) as a predictor (left) and with \(\small B\) as a predictor (right). As expected, the relationship is much stronger for \(\small B\). Also compare the regression coeffcients for \(\small B_R\) with teh regression coeffcient for the \(\small B\) in the model m.C_BD. They are nearly identical.

par(mfrow = c(1,2))

## Regression on residuals
Br_seq = seq(from=-3,to=3,length.out=30)
mu = link(m.C_Br,data=list(Br=Br_seq))
mu.mean <-apply(mu,2,mean)
mu.HPDI <-apply(mu,2,HPDI)
plot( C~Br,data=d, cex = .5,
      col=adjustcolor("black", alpha = .2),
      ylim = c(-4.5,4.5),
      xlim = c(-3,3))
lines( B_seq,mu.mean,lwd=2, col = "blue")
shade( mu.HPDI, B_seq, col = adjustcolor("blue", alpha = .2))

## Regression on original B
B_seq = seq(from=-3,to=3,length.out=30)
mu = link(m.C_B,data=list(B=B_seq))
mu.mean <-apply(mu,2,mean)
mu.HPDI <-apply(mu,2,HPDI)
plot( C~B,data=d, cex = .5,
      col=adjustcolor("black", alpha = .2),
      ylim = c(-4.5,4.5),
      xlim = c(-3,3))
lines( B_seq,mu.mean,lwd=2, col = "blue")
shade( mu.HPDI, B_seq, col = adjustcolor("blue", alpha = .2))

Posterior predictions tell us if the model is any good

Counterfactual plots tell us about the effect of intervention (if the causal model is true)


  1. If one has multiple measurements of the same variable over time, each measurement would be a new node↩︎

  2. Table 1 describes the sample↩︎